Dummy regression to predict dry fiber in Agave lechuguilla Torr. in two large-scale bioclimatic regions in Mexico

Agave lechuguilla Torr., of the family Agavaceae, is distributed from southwestern United States to southern Mexico and is one of the most representative species of arid and semiarid regions. Its fiber is extracted for multiple purposes. The objective of this study was to generate a robust model to predict dry fiber yield (Dfw) rapidly, simply, and inexpensively. We used a power model in its linear form and bioclimatic areas as dummy variables. Training, generation (80%) and validation (20%) of the model was performed using machine learning with the package ‘caret’ of R. Using canonical correlation analysis (CCA), we evaluated the relationship of Dwf to bioclimatic variables. The principal components analysis (PCA) generated two bioclimatic zones, each with different A. lechuguilla productivities. We evaluated 499 individuals in four states of Mexico. The crown diameter (Cd) of this species adequately predicts its fiber dry weight (R2 = 0.6327; p < 0.05). The intercept (β0), slope [lnCd (β1)], zone [(β2)] and interaction [lnCd:Zona (β3)] of the dummy model was statistically significant (p < 0.05), giving origin to an equation for each bioclimatic zone. The CCA indicates a positive correlation between minimum temperature of the coldest month (Bio 6) and Dwf (r = 0.84 and p < 0.05). In conclusion, because of the decrease in Bio 6 of more than 0.5°C by 2050, the species could be vulnerable to climate change, and A. lechuguilla fiber production could be affected gradually in the coming years.


Introduction
In Mexico, arid and semiarid regions cover 54% of the territory and are inhabited by approximately 40% of the country's population [1]. Their vegetation is scrub Crassicaule, Desert Microphyll, and Desert Rosetophile; the genera Larrea, Agave, Dasylirion, Yuca, etc., are the most representative [2]. One of the most economically and ecologically important species in arid regions is Agave lechuguilla Torr., a succulent plant of the Agavaceae family, distributed from southeastern United States to southern Mexico over 11 federal states [3,4], covering 142 115 km 2 [5].
This species, known as "lechuguilla", is outstanding for its fiber ("ixtle"), which is extracted from its bud ("cogollo") once the plant has reached 25 cm in height [6]. The fiber is used in the automobile industry, cordage, rugs, and cleaning brushes, among others [7]. Lechuguilla fiber can be extracted manually or mechanically [3,8]. The former method produces better quality and can sell for a price of up to 1.2 US dollars per kg. With this technique, 1.87 kg h -1 is produced, while the mechanical method can produce between 15 and 18 kg h -1 of lower quality fiber [9]. Better quality fiber is obtained from bud ("cogollos") with young leaves, which contain less lignin than leaves from the plant crown [8]. On lechuguilla plantations, the bud takes seven to eight months to regenerate (to reach � 25 cm of height), while naturally it takes between 16 and 24 months [10].
The fiber from A. lechuguilla is crucial for inhabitants of arid and semiarid regions of the country [11], benefitting nearly 52,000 families [7]. For this reason, it is necessary to quantify fiber existences. This can be done directly by harvesting the plant or indirectly using allometric models. In Mexico there are only three studies on estimating A. lechuguilla fiber dry weight: Blando & Baca [12], Pando et al. [13] and Velasco et al. [3]. All these studies were conducted at a local scale, and some of these models require harvesting the entire plant or part of it [8,14] to estimate fiber weight. Since the objective of creating a model is to make predictions with new data, it is necessary that the model comply with all the statistical assumptions, and it is especially important to evaluate its predictive capacity [15].
For this reason, it is crucial to generate robust allometric models to estimate A. lechuguilla fiber dry weight with non-destructive techniques that are simple, rapid, and efficient, and that solve the problem of scale, considering the productivity areas and determining the relation of fiber dry weight to bioclimatic variables. The objective of this study was to create a model of this type that will serve not only to support government procedures [16] for authorizing programs for use of the fiber but also for decision-making in management and conservation of the species.

Description of the study area
The study was conducted in the arid and semiarid regions in northeastern Mexico in the states of Chihuahua (Chih), Coahuila (Coah), San Luis Potosí (SLP) and Zacatecas (Zac). Agave lechuguilla grows in colluvial, sandy loam, limestone and clay soils [7,17] in areas where precipitation varies from 150 to 500 mm, at altitudes from 200 to 2400 m [2,11] and temperatures of 3 to 30˚C [18], and even in extreme temperatures of -8 to 44˚C. It is known that this plant can withstand droughts and floods [19]. minimum temperature of the coldest month, Bio 6 (˚C), and mean annual precipitation, Bio 12 (mm) in raster format at a resolution of 1 km 2 [20]. Using ArcMap 10.4.1, 1000 points were distributed randomly over the area of A. lechuguilla distribution in the country, for which the values of each bioclimatic variable and of altitude (m) obtained from the Modelo Digital de Elevación Mexicano (https://www.inegi.org.mx/app/geo2/elevacionesmex/) were extracted. With this information, a principal components analysis (PCA) was performed in R software [21] using the library 'FactoShiny' [22] which requires 'FactoMiner' [23], using the variables in standardized form to extract the first three principal components. Then, with the unrotated eigenvalues derived from PC1, an interpolation (Inverse Distance Weighted) was carried out in ArcMap, from which isolines spaced at 2.85 were derived.

Sampling Agave lechuguilla Torr for fiber extraction
Agave lechuguilla sampling was selective only in "ejidos" with gathering permits. therefore, a special permit was not necessary to harvest the individuals in this study.
Plants � 10 cm tall were collected (according to the gatherers, it is not until the plant reaches this height that the bud can be worked traditionally). Using a 3 m tape measure, we measured average crown diameter (Cd, cm) and total height (H, cm) on each selected plant. Prior to felling the plant (to obtain aerial biomass), the bud was extracted using a "cogollera" (a rustic instrument characteristic of the gatherers) to measure its length (Lc) and immediately extract the fiber manually. The fresh fiber was placed in a paper bag properly labeled and taken to the laboratory at the UAAAN to be dried in a Thermo Scientific TM HERAthem TM (Modelo OMH750) oven at 70˚C until it had a constant weight, which was obtained with a Torrey1 (Modelo L-EQ) scale with a capacity of 5 kg and precision of 1 g.

Generation and validation of the equation for estimating dry fiber weight
To predict dry fiber (Dfw) of individual A. lechuguilla plants, the allometric power equation (Eq 1) was tested in its linear form (Eq 2), applying logarithms to correct error variance [15], as used by Wood [24], Zárate et al. [25] and Flores et al. [26].
where Y = fiber dry weight (kg), testing crown diameter (Cd, cm), total plant height (H, cm), and the product of the two, Cd×H (cm×cm), as predictor variables, ln = natural logarithm and β 0 and β 1 = coefficients of regression.
Eq 2 was fit using the bioclimatic zone as the dummy variable D ¼ ing an effect to the intercept (Zone) and the model slope (X: Zone) to determine the existence of a model (for the entire area) or one for each bioclimatic zone, as follows: With 80% of all the data (obtained randomly by quantiles) and using the method of ordinary least squares [15], the model was trained using machine learning (ML) with the package R 'caret' [27]. With the same library, the model was then generated by bootstrap (n = 25). Possible atypical data (−3�r i �+3) were examined through studentized residuals r i . Influence on the coefficients of the model, a) Cook distance, D i (observation y i ), and on the precision of COVRATIO estimations [15,28,29], and b) DFBETAS (b i ). DFFITS (influence of the i th observation on the predicted values), and on the precision of the estimations. The total of the basic hypotheses of the regression model is summarized in the following expression: ε i~N (0, σ 2 ), i = 1,. . .n; that is, random, independent, and identically distributed errors, according to a normal distribution with mean zero and variance σ 2 .
After verifying compliance to all the previous assumptions, validation of the final model was carried out in 'caret' [27] with 20% of the independent data, using four methods: a) Leave One Out Cross-Validation, b) k-fold validation (k = 10), c) Repeated K-fold (k = 10 & 3 sets) and d) Bootstrap Cross-Validation, calculating the Coefficient of Determination (R 2 ), Root Mean Square Error (RMSE) and Mean Absolute Error (MAE).
The return of ln to its original units to predict A. lechuguilla fiber dry weight of the resulting equation is not direct since, if the distribution of a log Y is normal, a log X, result of the antilogarithm log Y, may be not normal and biased since the median is obtained and not the mean [24]. This would have to be corrected by applying to the final model a correction factor (calculated here), which is given by: , where SEE is the estimation error of the regression [30].

Canonical correlation analysis
With the aim of identifying the dependence of A. lechuguilla dry fiber with environmental variables, as well as the interdependence between factors (matrices X, Y): [Cd, H and Dfw], [Bio 1, Bio 5, Bio 6, Bio 12 and altitude], a Canonical Correlation Analysis was performed using the package R 'CCA' [31] of R statistical software [21]. The matrices X, Y were generated with the average of the variable per municipality (S1 Table).
Using the same variables, plus slope, aspect and records of the presence of the species, some authors [32,33] have generated zones of productive potential in this region of the country, estimating between 9 and 5 million ha that are ideal for A. lechuguilla plantations, mainly in zone 2 of our study (Fig 1B). Flores et al. [26] constructed models to estimate aboveground biomass of this species, considering the state boundaries (Fig 1) as the dummy variable (or zone) and found differences in biomass among the zones.
When we examined the means of the dendrometric (Cd, H and Dfw) and bioclimatic (Bio 1, Bio 5, Bio 6, Bio 12 and altitude) variables through dummy-regression models equivalent to Student "t" tests) in logarithmic form for compliance to variance normality and homogeneity: ln(Y) = β 0 +β 1 ×Zone, we found that there were statistically significant differences among zones (p � 0.05). In Zone 2, A. lechuguilla height, crown diameter, and quantity of fiber were higher than in Zone 1. This appears to obey basically minimum temperature (Bio 6) and, to a lesser degree, precipitation (Bio 12), which are also the highest (Fig 2, S2 Table) in this zone. There was no proof for rejecting the H 0 that mean annual precipitation and altitude are the same among zones (p = 0.956 and p = 0.106).

Characterization of the Agave lechuguilla Torr. sample
The total number of plants evaluated was 499, distributed geographically over a distance of more than 1000 km, according to   The plants were obtained from 20 municipalities and 46 ejidos (S1 Appendix 1). It can be seen that A. lechuguilla individuals are taller (52.76 cm) and crown diameters are larger (66.10 cm) in Zone 2 ( Table 1, Fig 1), and therefore, they accumulate on average a larger amount of fiber in their plant tissues (22.21 g plant -1 ), but up to maximums of 34.15 and 46.60 g plant -1 , according to percentile 95 (Zone 1 and Zone 2). In general, to predict A. lechuguilla dry fiber, a selective sampling is applied, considering a lower limit for plant height; for example, in Zone 2 of this study, Blando and Baca [12] selected plants � 45 cm tall finding that the accumulation of dry fiber can be between 45.6 and 70.5 g plant -1 , much higher than the average of our study. We included plants � 10 cm tall, and for this reason, when averages (or even medians) are compared with previous studies, the differences are contrasting.

Analysis of predictors of Agave lechuguilla Torr. dry fiber
In a preliminary analysis Eq 2 was adjusted with each predictor (Cd, H, Cd×H). Although all show the capacity to predict A. lechuguilla dry fiber (0.519 < R 2 < 0.633 and p-value< 2e-16), only Cd passes all the assumptions of a linear model (Table 2). For this reason, Cd was selected as the predictor of A. lechuguilla dry fiber. It has been demonstrated that the best predictors in this species are Cd and H, e.g. Valencia et al. [37] used Cd (cm) and plant volume (m 3 ) to predict plant fresh weight (kg), with correlations of 0.9272 and 0.9433, respectively. Flores et al. [26] show that Cd and H efficiently predict aboveground biomass, explaining 91.4%. With destructive methods, it has been proven that heart volume (cm 3 ) can explain from 57.9% [7,25] to 90.98% [11] of the heart fresh weight. The moisture index (MPa) and the index of photosynthetically active radiation (mmol m 2 �S -1 ) explain 97% and 25% of the biomass of this species [38].
In addition, statistically significant (p < 0.001) results were obtained with the hypothesis tests on the coefficients of regression with different sample sizes (Table 3): intercept (β 0 ), slope [lnCd (β 1 )], dummy variable [Zone (β 2 )] and interaction [lnCd:Zone (β 3 )] of Eq 3. This demonstrates correct bioclimatic zoning of A. lechuguilla productivity with the PCA and shows that the predictor (lnCd) of dry fiber does not originate from a random model.

Model for predicting Agave lechuguilla Torr. dry fiber
The final model obtained with ML showed an R 2 of 0.629. The coefficients of regression were highly significant, p<0.05 (Table 4), similar to the average obtained with different sample sizes ( Table 3). The dummy variable (Zone) and the interaction lnCd:Zone show p values of 0.0004 and 0.0014 (Table 4), giving way to an equation for each bioclimatic zone (Fig 3A) for estimating A. lechuguilla dry fiber.

PLOS ONE
Dummy regression to predict dry fiber in Agave lechuguilla Because of the wide variation in fiber observed in this species (Table 1), in the regression analysis we present some influential observations (Fig 3B) on the model coefficients: D i (DFFITS), but none with D i >1 that could be eliminated. There were no influential observations regarding DFBETAS. With COVRATIO, several observations were outstanding as influential, but we underline that no atypical r i data were detected. The correction (1.127) and multiplicative factor obtained at the final equation can correct the bias and improve the estimations of Dfw by 12.7%.

PLOS ONE
Dummy regression to predict dry fiber in Agave lechuguilla

Model capacity for predicting dry fiber content of Agave lechuguilla
Torr.
Model validation using machine learning, with 100 independent data (data that the model would never have "seen") showed that the model exhibits good capacity for predicting A. lechuguilla fiber dry weight. The fit statistics (all in logarithmic units) in the four methods of validation (R 2 ) are clearly lower (e. g. average R 2 of 0.544) than those resulting from the set of training data (R 2 = 0.629), but very similar to each other (Table 5). This guarantees that, by using the generated equations to predict A. lechuguilla dry fiber with data that the model has never seen, the estimations are adequate. According to the literature, using independent data to validate a model is one of the most recommended techniques [39,40]. Using a set of independent data and applying the model generated with the training set, Velasco et al. [3] evaluated its capacity to predict dry fiber in this species and found that the fit statistics and predictions are very similar, assuring robustness when predicting with new data.
In similar studies conducted in Zone 2 (Fig 1) with n = 287, Blando and Baca [12] found that Cd and number of usable leaves explained 76.09% of the A. lechuguilla fiber. Using 95 individuals (H � 25 cm) in Zone 2 of this study, Pando et al. [13] demonstrated that H and Cd (as a single independent variable) explained 86.9% of A. lechuguilla fiber dry weight, with an error of 5.041 g. In the same Zone (n = 240) Velasco et al. [3] found that the diameter and height of the bud explained 68% of the fiber in A. lechuguilla. It is evident that the fit statistics of each model differ from that of our study and from each other due to the sample size (n), sampling method (directed, selective, quadrants), plant selection (minimum height),  geographic location of the study (zone 1 or 2), model type (linear, non-linear, simple or multiple), type of predictor and even compliance or non-compliance with the model assumptions.
Although linear models have been generated to estimate fiber weight in A. lechuguilla [12,13], the allometry reveals exponential-type changes in dimensions relative to the parts, e. g. in Agave salmiana Otto ex. Salm. ssp. crassispina (Trel.) Gentry aboveground biomass follows this pattern [41], as in Agave salmiana ssp. Crassispina [42], and as has been demonstrated for prediction of aboveground biomass in A. lechuguilla [26] and dry fiber in our study, applying the power equation (Eq 1) in its linear form (Eq 2).

Canonical relationship between A. lechuguilla dry fiber and bioclimatic variables
According to the first and second canonical variable, the degree of correlation between dendrometric variables (matrix Y) is higher (r = 0.9000) than that between bioclimatic variables (matrix X), which is r = 0.7345 (Fig 5A). The correlation in the first three dimensions is 0.8101, 0.5395 and 0.2510, respectively. Crossed canonical correlation shows that the minimum temperature of the coldest month, Bio 6, and annual precipitation, Bio 12 ( Fig 5A, 5B and Table 6), correlated positively with A. lechuguilla dry fiber production in these bioclimatic zones. According to the three models of general circulation, CCSM4, HadGEM2-AO and MIROC5 [20], in a scenario of moderate CO 2 emissions (4.5 W/m 2 ), it is expected that by 2050, Bio 6 will decrease 0.56˚C in the area of A. lechuguilla distribution, possibly affecting the species' fiber production, while Bio 12, will decrease only 1.79%.

Conclusions
The principal components analysis generated two bioclimatic zones, in Zone 2 (located in the eastern part of the study area) A. lechuguilla productivity is higher than in Zone 1. This zoning  is useful when focusing efforts specifically on management and use of this species. The best predictor of dry fiber content is crown diameter (Cd). Using dummy regression, it was possible to establish an equation for each bioclimatic zone (Fig 1B). Validation of this equation revealed high predictive capacity of the model, which is also easy, rapid, and inexpensive to use and is useful in large part of the known distribution area of the species. The results suggest that the species could be vulnerable to climate change with a decrease in Bio 6 of more than 0.5˚C by 2050 and there would likely be a gradual reduction in A. lechuguilla fiber production over the coming years.
Supporting information S1